{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\nAuto-tuning a Convolutional Network for x86 CPU\n===============================================\n**Author**: `Yao Wang <https://github.com/kevinthesun>`_, `Eddie Yan <https://github.com/eqy>`_\n\nThis is a tutorial about how to tune convolution neural network\nfor x86 CPU.\n\nNote that this tutorial will not run on Windows or recent versions of macOS. To\nget it to run, you will need to wrap the body of this tutorial in a :code:`if\n__name__ == \"__main__\":` block.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import os\nimport numpy as np\n\nimport tvm\nfrom tvm import relay, autotvm\nfrom tvm.relay import testing\nfrom tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner\nfrom tvm.autotvm.graph_tuner import DPTuner, PBQPTuner\nimport tvm.contrib.graph_executor as runtime"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Define network\n--------------\nFirst we need to define the network in relay frontend API.\nWe can either load some pre-defined network from :code:`relay.testing`\nor building :any:`relay.testing.resnet` with relay.\nWe can also load models from MXNet, ONNX and TensorFlow.\n\nIn this tutorial, we choose resnet-18 as tuning example.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def get_network(name, batch_size):\n    \"\"\"Get the symbol definition and random weight of a network\"\"\"\n    input_shape = (batch_size, 3, 224, 224)\n    output_shape = (batch_size, 1000)\n\n    if \"resnet\" in name:\n        n_layer = int(name.split(\"-\")[1])\n        mod, params = relay.testing.resnet.get_workload(\n            num_layers=n_layer, batch_size=batch_size, dtype=dtype\n        )\n    elif \"vgg\" in name:\n        n_layer = int(name.split(\"-\")[1])\n        mod, params = relay.testing.vgg.get_workload(\n            num_layers=n_layer, batch_size=batch_size, dtype=dtype\n        )\n    elif name == \"mobilenet\":\n        mod, params = relay.testing.mobilenet.get_workload(batch_size=batch_size, dtype=dtype)\n    elif name == \"squeezenet_v1.1\":\n        mod, params = relay.testing.squeezenet.get_workload(\n            batch_size=batch_size, version=\"1.1\", dtype=dtype\n        )\n    elif name == \"inception_v3\":\n        input_shape = (batch_size, 3, 299, 299)\n        mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)\n    elif name == \"mxnet\":\n        # an example for mxnet model\n        from mxnet.gluon.model_zoo.vision import get_model\n\n        block = get_model(\"resnet18_v1\", pretrained=True)\n        mod, params = relay.frontend.from_mxnet(block, shape={input_name: input_shape}, dtype=dtype)\n        net = mod[\"main\"]\n        net = relay.Function(\n            net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs\n        )\n        mod = tvm.IRModule.from_expr(net)\n    else:\n        raise ValueError(\"Unsupported network: \" + name)\n\n    return mod, params, input_shape, output_shape\n\n\n# Replace \"llvm\" with the correct target of your CPU.\n# For example, for AWS EC2 c5 instance with Intel Xeon\n# Platinum 8000 series, the target should be \"llvm -mcpu=skylake-avx512\".\n# For AWS EC2 c4 instance with Intel Xeon E5-2666 v3, it should be\n# \"llvm -mcpu=core-avx2\".\ntarget = \"llvm\"\n\nbatch_size = 1\ndtype = \"float32\"\nmodel_name = \"resnet-18\"\nlog_file = \"%s.log\" % model_name\ngraph_opt_sch_file = \"%s_graph_opt.log\" % model_name\n\n# Set the input name of the graph\n# For ONNX models, it is typically \"0\".\ninput_name = \"data\"\n\n# Set number of threads used for tuning based on the number of\n# physical CPU cores on your machine.\nnum_threads = 1\nos.environ[\"TVM_NUM_THREADS\"] = str(num_threads)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Configure tensor tuning settings and create tasks\n-------------------------------------------------\nTo get better kernel execution performance on x86 CPU,\nwe need to change data layout of convolution kernel from\n\"NCHW\" to \"NCHWc\". To deal with this situation, we define\nconv2d_NCHWc operator in topi. We will tune this operator\ninstead of plain conv2d.\n\nWe will use local mode for tuning configuration. RPC tracker\nmode can be setup similarly to the approach in\n`tune_relay_arm` tutorial.\n\nTo perform a precise measurement, we should repeat the measurement several\ntimes and use the average of results. In addition, we need to flush the cache\nfor the weight tensors between repeated measurements. This can make the measured\nlatency of one operator closer to its actual latency during end-to-end inference.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tuning_option = {\n    \"log_filename\": log_file,\n    \"tuner\": \"random\",\n    \"early_stopping\": None,\n    \"measure_option\": autotvm.measure_option(\n        builder=autotvm.LocalBuilder(),\n        runner=autotvm.LocalRunner(\n            number=1, repeat=10, min_repeat_ms=0, enable_cpu_cache_flush=True\n        ),\n    ),\n}\n\n\n# You can skip the implementation of this function for this tutorial.\ndef tune_kernels(\n    tasks, measure_option, tuner=\"gridsearch\", early_stopping=None, log_filename=\"tuning.log\"\n):\n\n    for i, task in enumerate(tasks):\n        prefix = \"[Task %2d/%2d] \" % (i + 1, len(tasks))\n\n        # create tuner\n        if tuner == \"xgb\" or tuner == \"xgb-rank\":\n            tuner_obj = XGBTuner(task, loss_type=\"rank\")\n        elif tuner == \"ga\":\n            tuner_obj = GATuner(task, pop_size=50)\n        elif tuner == \"random\":\n            tuner_obj = RandomTuner(task)\n        elif tuner == \"gridsearch\":\n            tuner_obj = GridSearchTuner(task)\n        else:\n            raise ValueError(\"Invalid tuner: \" + tuner)\n\n        # do tuning\n        n_trial = len(task.config_space)\n        tuner_obj.tune(\n            n_trial=n_trial,\n            early_stopping=early_stopping,\n            measure_option=measure_option,\n            callbacks=[\n                autotvm.callback.progress_bar(n_trial, prefix=prefix),\n                autotvm.callback.log_to_file(log_filename),\n            ],\n        )\n\n\n# Use graph tuner to achieve graph level optimal schedules\n# Set use_DP=False if it takes too long to finish.\ndef tune_graph(graph, dshape, records, opt_sch_file, use_DP=True):\n    target_op = [\n        relay.op.get(\"nn.conv2d\"),\n    ]\n    Tuner = DPTuner if use_DP else PBQPTuner\n    executor = Tuner(graph, {input_name: dshape}, records, target_op, target)\n    executor.benchmark_layout_transform(min_exec_num=2000)\n    executor.run()\n    executor.write_opt_sch2record_file(opt_sch_file)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, we launch tuning jobs and evaluate the end-to-end performance.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def evaluate_performance(lib, data_shape):\n    # upload parameters to device\n    dev = tvm.cpu()\n    data_tvm = tvm.nd.array((np.random.uniform(size=data_shape)).astype(dtype))\n    module = runtime.GraphModule(lib[\"default\"](dev))\n    module.set_input(input_name, data_tvm)\n\n    # evaluate\n    print(\"Evaluate inference time cost...\")\n    print(module.benchmark(dev, number=100, repeat=3))\n\n\ndef tune_and_evaluate(tuning_opt):\n    # extract workloads from relay program\n    print(\"Extract tasks...\")\n    mod, params, data_shape, out_shape = get_network(model_name, batch_size)\n    tasks = autotvm.task.extract_from_program(\n        mod[\"main\"], target=target, params=params, ops=(relay.op.get(\"nn.conv2d\"),)\n    )\n\n    # run tuning tasks\n    tune_kernels(tasks, **tuning_opt)\n    tune_graph(mod[\"main\"], data_shape, log_file, graph_opt_sch_file)\n\n    # compile kernels in default mode\n    print(\"Evaluation of the network compiled in 'default' mode without auto tune:\")\n    with tvm.transform.PassContext(opt_level=3):\n        print(\"Compile...\")\n        lib = relay.build(mod, target=target, params=params)\n        evaluate_performance(lib, data_shape)\n\n    # compile kernels in kernel tuned only mode\n    print(\"\\nEvaluation of the network been tuned on kernel level:\")\n    with autotvm.apply_history_best(log_file):\n        print(\"Compile...\")\n        with tvm.transform.PassContext(opt_level=3):\n            lib = relay.build(mod, target=target, params=params)\n        evaluate_performance(lib, data_shape)\n\n    # compile kernels with graph-level best records\n    print(\"\\nEvaluation of the network been tuned on graph level:\")\n    with autotvm.apply_graph_best(graph_opt_sch_file):\n        print(\"Compile...\")\n        with tvm.transform.PassContext(opt_level=3):\n            lib = relay.build_module.build(mod, target=target, params=params)\n        evaluate_performance(lib, data_shape)\n\n\n# We do not run the tuning in our webpage server since it takes too long.\n# Uncomment the following line to run it by yourself.\n\n# tune_and_evaluate(tuning_option)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Sample Output\n-------------\nThe tuning needs to compile many programs and extract feature from them.\nSo a high performance CPU is recommended.\nOne sample output is listed below.\n\n.. code-block:: bash\n\n   Extract tasks...\n   Tuning...\n   [Task  1/12]  Current/Best:  598.05/2497.63 GFLOPS | Progress: (252/252) | 1357.95 s Done.\n   [Task  2/12]  Current/Best:  522.63/2279.24 GFLOPS | Progress: (784/784) | 3989.60 s Done.\n   [Task  3/12]  Current/Best:  447.33/1927.69 GFLOPS | Progress: (784/784) | 3869.14 s Done.\n   [Task  4/12]  Current/Best:  481.11/1912.34 GFLOPS | Progress: (672/672) | 3274.25 s Done.\n   [Task  5/12]  Current/Best:  414.09/1598.45 GFLOPS | Progress: (672/672) | 2720.78 s Done.\n   [Task  6/12]  Current/Best:  508.96/2273.20 GFLOPS | Progress: (768/768) | 3718.75 s Done.\n   [Task  7/12]  Current/Best:  469.14/1955.79 GFLOPS | Progress: (576/576) | 2665.67 s Done.\n   [Task  8/12]  Current/Best:  230.91/1658.97 GFLOPS | Progress: (576/576) | 2435.01 s Done.\n   [Task  9/12]  Current/Best:  487.75/2295.19 GFLOPS | Progress: (648/648) | 3009.95 s Done.\n   [Task 10/12]  Current/Best:  182.33/1734.45 GFLOPS | Progress: (360/360) | 1755.06 s Done.\n   [Task 11/12]  Current/Best:  372.18/1745.15 GFLOPS | Progress: (360/360) | 1684.50 s Done.\n   [Task 12/12]  Current/Best:  215.34/2271.11 GFLOPS | Progress: (400/400) | 2128.74 s Done.\n   INFO Start to benchmark layout transformation...\n   INFO Benchmarking layout transformation successful.\n   INFO Start to run dynamic programming algorithm...\n   INFO Start forward pass...\n   INFO Finished forward pass.\n   INFO Start backward pass...\n   INFO Finished backward pass...\n   INFO Finished DPExecutor run.\n   INFO Writing optimal schedules to resnet-18_graph_opt.log successfully.\n\n   Evaluation of the network compiled in 'default' mode without auto tune:\n   Compile...\n   Evaluate inference time cost...\n   Mean inference time (std dev): 4.5 ms (0.03 ms)\n\n   Evaluation of the network been tuned on kernel level:\n   Compile...\n   Evaluate inference time cost...\n   Mean inference time (std dev): 3.2 ms (0.03 ms)\n\n   Evaluation of the network been tuned on graph level:\n   Compile...\n   Config for target=llvm -keys=cpu -link-params=0, workload=('dense_nopack.x86', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression.\n   Config for target=llvm -keys=cpu -link-params=0, workload=('dense_pack.x86', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression.\n   Evaluate inference time cost...\n   Mean inference time (std dev): 3.16 ms (0.03 ms)\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}